Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP] Fix --disk-size for Custom Machine Images #2718

Merged
merged 27 commits into from
Nov 15, 2023
Merged

[GCP] Fix --disk-size for Custom Machine Images #2718

merged 27 commits into from
Nov 15, 2023

Conversation

jackyk02
Copy link
Contributor

@jackyk02 jackyk02 commented Oct 18, 2023

Issue Addressed: #2488

As reported in the issue, it's clear that when launching a GCP cluster with a machine image originating from a 150 GB disk, specifying the --disk-size 200 parameter results in an instance retaining the original 150 GB size.

Based on GCP's documentation on resizing persistent disks, when we create a custom Linux image, we have to manually increase the size of the boot and non-boot disks. To achieve this, I used the Compute Engine API to modify disk sizes within our custom GCP node provider. Comments have been added to the code within the commits to provide additional context into the changes made. Please feel free to comment if further modifications are required.

Custom Image Test:

  1. Launch the custom image with a specified disk size:
    sky launch -c custom-image --cloud gcp --image-id [image id path] --disk-size 200
    sky launch -c custom-image --cloud gcp --image-id [image id path] --disk-size 150

  2. Confirmed the disk size has been increased to 200GB from the original 150GB for the first test, and ensured that users are able to create an instance with the image's original size.

Multi-Node Test:

  1. Launch 8 nodes with a specified disk size:
    sky launch -c my-cluster --num-nodes 8 --image-id [image id path] --cloud gcp --disk-size 200
  2. Verified the disk of each node has been resized to 200GB.

TPU Test:

  1. Launch the custom image with a specified TPU accelerator and disk size:
    sky launch tpu_node.yaml -c custom-image --cloud gcp --image-id [image id path] --disk-size 200, where the YAML file is provided below:

    resources:
      instance_type: n1-highmem-8
      accelerators: tpu-v2-8
      accelerator_args:
        tpu_vm: False
    
  2. Verified that the disk has been resized to 200GB for TPU node.

Note: TPU VMs include a 100GB boot disk that cannot be resized. Users have to add a Persistent Disk to expand their local disk capacity according to the documentation. This requirement may be associated with a feature request at: #2387

@Michaelvll Michaelvll requested a review from cblmemo October 18, 2023 07:34
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing to SkyPilot! This is exciting ; ) Left a suggestion 🙂
Just want to make sure, from your PR description it seems like the resize will happen only after a restart? Is that true? If this is the case, shall we add some CLI hints to notify the user?

sky/skylet/providers/gcp/node.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll requested a review from cblmemo October 22, 2023 05:51
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt fix! Left several comments 😉

sky/skylet/providers/gcp/node.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Left several comments 👀

sky/skylet/providers/gcp/node.py Show resolved Hide resolved
sky/skylet/providers/gcp/node.py Outdated Show resolved Hide resolved
sky/skylet/providers/gcp/node.py Outdated Show resolved Hide resolved
sky/skylet/providers/gcp/node.py Outdated Show resolved Hide resolved
@jackyk02
Copy link
Contributor Author

jackyk02 commented Nov 6, 2023

Every remaining issue we talked about in the meeting should have been resolved. Please let me know if any additional clarification is needed.

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 6, 2023

Every remaining issue we talked about in the meeting should have been resolved. Please let me know if any additional clarification is needed.

Could you provide all the tests you have done? I don't see any TPUVM test in your PR description 🤔

@jackyk02
Copy link
Contributor Author

jackyk02 commented Nov 7, 2023

Could you provide all the tests you have done? I don't see any TPUVM test in your PR description 🤔

Updated the PR descriptions to include additional tests for GCPTPUNode. As the TPU VMs issue is unrelated to custom machine images (refer to the above description), it would be better to address it in a separate PR. Additionally, another issue discussed in our recent meeting is documented at: #2760.

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 7, 2023

Could you provide all the tests you have done? I don't see any TPUVM test in your PR description 🤔

Updated the PR descriptions to include additional tests for GCPTPUNode. As the TPU VMs issue is unrelated to custom machine images (refer to the above description), it would be better to address it in a separate PR. Additionally, another issue discussed in our recent meeting is documented at: #2760.

If this is the case, what if we specify a custom (machine) image with a size that is more than 100GB to a TPU VM? Will it directly fail? If so, we should remove related code in the GCPTPU class since it will never get executed; also, we should check in resources._try_validate_image_id.

@jackyk02
Copy link
Contributor Author

jackyk02 commented Nov 7, 2023

If this is the case, what if we specify a custom (machine) image with a size that is more than 100GB to a TPU VM? Will it directly fail?

Thanks for the prompt reply! No, it will not fail directly. This issue isn't specifically tied to the custom image itself or #2488. Even when launching a TPU VM with sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 300, the resulting VM is still initialized with a disk size of 100 GB. Users have to add a persistent disk to expand the capacity.

Here is the output of the df -h on the created TPU VM:

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        97G   11G   87G  11% /

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 7, 2023

If this is the case, what if we specify a custom (machine) image with a size that is more than 100GB to a TPU VM? Will it directly fail?

Thanks for the prompt reply! No, it will not fail directly. This issue isn't specifically tied to the custom image itself or #2488. Even when launching a TPU VM with sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 300, the resulting VM is still initialized with a disk size of 100 GB. Users have to add a persistent disk to expand the capacity.

Here is the output of the df -h on the created TPU VM:

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        97G   11G   87G  11% /

Then at least we should remove related code in the GCPTPU class since it will never get executed? Also, if we doesn't plan to address the problem in this PR, can we file an issue to keep track of this problem?

@jackyk02
Copy link
Contributor Author

jackyk02 commented Nov 7, 2023

Then at least we should remove related code in the GCPTPU class since it will never get executed? Also, if we doesn't plan to address the problem in this PR, can we file an issue to keep track of this problem?

Yes, that's a good point. Regarding the issue tracking, I've already filed it under issue #2387.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost done..! Left several nits
Also, it would be great to run the smoke test to make sure the PR doesn't break anything ; )

sky/skylet/providers/gcp/node.py Outdated Show resolved Hide resolved
sky/skylet/providers/gcp/node_provider.py Show resolved Hide resolved
sky/skylet/providers/gcp/node.py Outdated Show resolved Hide resolved
sky/skylet/providers/gcp/node.py Outdated Show resolved Hide resolved
sky/skylet/providers/gcp/node.py Outdated Show resolved Hide resolved
@jackyk02
Copy link
Contributor Author

jackyk02 commented Nov 7, 2023

Thank you very much for your contributions to this PR! Let me know if there are any more changes needed.

sky/resources.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one nit! It should be good to go ;)

sky/resources.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work @jackyk02! Left a comment. ; )

sky/resources.py Outdated
@@ -834,13 +834,15 @@ def _try_validate_image_id(self) -> None:
# Check the image exists and get the image size.
# It will raise ValueError if the image does not exist.
image_size = self.cloud.get_image_size(image_id, region)
if image_size > self.disk_size:
if image_size >= self.disk_size:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= seems unexpected. Do you mean you cannot create a machine with the same size as the image? I think we did try creating image with 256GB and launch a VM from the image with the same disk size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users should be able to create instances with the same size as the image in the latest commit. I’ve updated the PR description to include additional tests for it.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @jackyk02! Just tried this PR:

  1. sky launch -c test-machine-image --cloud gcp "echo hi > hello.txt"
  2. create a machine image for test-machine-image on the dashboard
  3. sky launch -c test-use-machine-image --cloud gcp --image-id projects/xxxx/global/machineIma ges/test-machine-image-resize --disk-size 500

It works well! I am going to merge this PR soon.

@@ -18,7 +18,7 @@ docker:
{%- endif %}

provider:
# We use a custom node provider for GCP to support instance stop and reuse.
# We use a custom node provider for GCP to create, stop and reuse instances.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@Michaelvll Michaelvll merged commit 7d9bcec into skypilot-org:master Nov 15, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants